Welcome to our workshop on inferential modelling with wide data. We hope you enjoy the session.
This workshop will cover the problems associated with inferential modelling of high dimensional, wide data, suggest approaches to overcome them and provide hands-on training in the implementation of regularisation techniques, covariate selection stability and triangulation.
The following packages are required for these exercises.
library(Hmisc)
library(broom)
library(glmnet)
library(ncvreg)
library(bigstep)
library(rsample)
library(tidyverse)
library(stabiliser)
In order to appreciate the issues we will be discussing today, we have provided functions to simulate datasets for exploration.
The following function generates a dataset with “ncols” as the number of variables and “nrows” as the number of rows.
generate_uncor_variables <- function(ncols, nrows) {
data_frame(replicate(ncols, rnorm(nrows, 0, 1)))
}
A dataset with 197 rows and 130 variables can then be generated using this function as follows:
variables <- generate_uncor_variables(ncols = 130, nrows = 197)
This results in the following dataset being generated:
variables
We can also generate an outcome variable, in this case randomly generated in the same manner, but renaming as “outcome”
generate_uncor_outcome <- function(nrows) {
data_frame(replicate(1, rnorm(nrows, 0, 1))) %>%
rename("outcome" = 1)
}
outcome <- generate_uncor_outcome(nrows = 197)
outcome
We can now bind together the uncorrelated, randomly generated variables, with the randomly generated outcome.
df_no_signal <- outcome %>%
bind_cols(variables)
This results in a dataset of 197 rows, with a single outcome variable, which has no relationship to the 130 columns as shown below.
df_no_signal
The following function conducts univariable analysis to determine the association between a given variable and the outcome. A pearson/spearman rank correlation matrix is another option for this.
univariable_analysis <- function(data, variable) {
data %>%
lm(outcome ~ variable, .) %>%
tidy() %>%
filter(term != "(Intercept)")
}
This function can then be applied using map_df() to each column of the dataset individually and return a dataframe.
univariable_outcomes <- map_df(df_no_signal, ~ univariable_analysis(data = df_no_signal, variable = .), .id = "variable")
A conventional approach would then filter at a given threshold (for example P<0.2).
univariable_outcomes_filtered <- univariable_outcomes %>%
filter(p.value < 0.2)
This results in a table below of all of the variables that have a p-value of <0.2 to be carried forward into a multivariable model.
univariable_outcomes_filtered
A list of variables to be included is as follows:
variables_for_stepwise <- univariable_outcomes_filtered %>%
pull(variable)
These variables would subsequently be offered into a stepwise selection process such as the following
stepwise_model <- function(data, variables) {
data_selected <- data %>%
select(variables)
lm(outcome ~ ., data = data_selected) %>%
step(., direction = "backward", trace = FALSE) %>%
tidy() %>%
filter(p.value < 0.05) %>%
rename(variable = term)
}
prefiltration_results <- stepwise_model(data = df_no_signal, variables = variables_for_stepwise)
prefiltration_results
We will test a variety of models on this dataset. For future comparison let’s set up a list where we can store model results
model_results <- list()
generate_data_with_signal <- function(nrow, ncol, n_causal_vars, amplitude) {
# Generate the variables from a multivariate normal distribution
mu <- rep(0, ncol)
rho <- 0.25
sigma <- toeplitz(rho^(0:(ncol - 1))) # Symmetric Toeplitz Matrix
X <- matrix(rnorm(nrow * ncol), nrow) %*% chol(sigma) # multiply matrices Choleski Decomposition. Description. Compute the Choleski factorization of a real symmetric positive-definite square matrix)
# Generate the response from a linear model
nonzero <- sample(ncol, n_causal_vars) # select the id of 'true' variables
beta <- amplitude * (1:ncol %in% nonzero) / sqrt(nrow) # vector of effect sizes to pick out true varaiables
beta_value <- amplitude / sqrt(nrow)
outcome.sample <- function(X) X %*% beta + rnorm(nrow) # calculate outcome from true vars and error
outcome <- outcome.sample(X)
## Rename true variables
X_data <- as.data.frame(X)
for (i in c(nonzero)) {
X_data1 <- X_data %>%
rename_with(.cols = i, ~ paste("causal_", i, sep = ""))
X_data <- X_data1
}
dataset_sim <- as.data.frame(cbind(outcome, X_data1))
}
We can now simulate a dataset df_signal with 300 rows and 300 variables, 8 of which have a relationship with the outcome.
We can also alter the signal strenght of causal variables by changing the “amplitute” paramater.
df_signal <- generate_data_with_signal(nrow = 300, ncol = 300, n_causal_vars = 8, amplitude = 7)
As we have simulated this dataset, we can “cheat” and create the perfect model to check everything’s worked correctly.
df_signal %>%
select(outcome, contains("causal_")) %>%
lm(outcome~., data=.) %>%
tidy()
We can now repeat out prefiltration and stepwise selection approach as before
univariable_outcomes <- map_df(df_signal, ~ univariable_analysis(data = df_signal, variable = .), .id = "variable")
univariable_outcomes_filtered <- univariable_outcomes %>%
filter(p.value < 0.2)
variables_for_stepwise <- univariable_outcomes_filtered %>%
pull(variable)
model_results$prefiltration <- stepwise_model(data = df_signal, variables = variables_for_stepwise)
model_results$prefiltration
model_lasso <- function(data) {
y_temp <- data %>%
select("outcome") %>%
as.matrix()
x_temp <- data %>%
select(-"outcome") %>%
as.matrix()
fit_lasso <- cv.glmnet(x = x_temp, y = y_temp, alpha = 1)
coefs <- coef(fit_lasso, s = "lambda.min")
data.frame(name = coefs@Dimnames[[1]][coefs@i + 1], coefficient = coefs@x) %>%
rename(
variable = name,
estimate = coefficient
) %>%
filter(variable != "(Intercept)") %>%
select(variable, estimate)
}
model_results$lasso <- model_lasso(df_signal)
MCP can also be used
model_mcp <- function(data) {
y_temp <- data %>%
select("outcome") %>%
as.matrix()
x_temp <- data %>%
select(-"outcome")
fit_mcp <- cv.ncvreg(X = x_temp, y = y_temp)
fit_mcp %>%
coef() %>%
as_tibble(rownames = "variable") %>%
rename(
estimate = value
) %>%
filter(
variable != "(Intercept)",
estimate != 0,
!grepl("(Intercept)", variable),
!grepl("Xm[, -1]", variable)
) %>%
mutate(variable = str_remove_all(variable, "`"))
}
model_results$mcp <- model_mcp(df_signal)
MBIC can also be used
model_mbic <- function(data) {
y_temp <- data %>%
select("outcome") %>%
as.matrix()
x_temp <- data %>%
select(-"outcome")
bigstep_prepped <- bigstep::prepare_data(y_temp, x_temp, verbose = FALSE)
bigstep_prepped %>%
reduce_matrix(minpv = 0.01) %>%
fast_forward(crit = mbic) %>%
multi_backward(crit = mbic) %>%
summary() %>%
stats::coef() %>%
as.data.frame() %>%
rownames_to_column(., var = "variable") %>%
mutate(variable = str_remove_all(variable, "`")) %>%
filter(
!grepl("(Intercept)", variable),
!grepl("Xm[, -1]", variable)
) %>%
rename(estimate = Estimate) %>%
select(variable, estimate)
}
model_results$mbic <- model_mbic(df_signal)
A comparison of the number of True/False positives is shown below:
calculate_tp_fp <- function(results) {
results %>%
mutate(causal = case_when(
grepl("causal", variable) ~ "tp",
!grepl("causal", variable) ~ "fp"
)) %>%
group_by(model) %>%
summarise(
tp = sum(causal == "tp", na.rm = TRUE),
fp = sum(causal == "fp", na.rm = TRUE)
) %>%
mutate("total_selected" = tp + fp)
}
conventional_results <- model_results %>%
bind_rows(., .id = "model") %>%
calculate_tp_fp()
Function for bootstrapping
boot_sample <- function(data, boot_reps) {
rsample::bootstraps(data, boot_reps)
}
bootstrapped_datasets <- boot_sample(data = df_signal, boot_reps = 10)
Bootstrapped data is presented here as a table of 10 different nested tables.
bootstrapped_datasets
If we extract a single bootstrapped dataset and sort by the outcome, we can see that several rows have been resampled. Consequently as the dataset length is the same as the original, several rows will be omitted completely.
bootstrapped_datasets$splits[[1]] %>%
as_tibble() %>%
arrange(outcome)
We can apply our previous lasso function over each one of these bootstrapped resamples.
model_lasso_bootstrapped <- bootstrapped_datasets %>%
map_df(.x = .$splits, .f = ~ as.data.frame(.) %>% model_lasso(.), .id = "bootstrap")
To identify a null threshold, first we must permute the outcome.
Our original dataset looks like this:
df_signal
By permuting the outcome variable we sever all ties between the outcome and the explanatory variables. We might want to conduct this 5 times.
permuted_datasets <- rsample::permutations(data = df_signal, permute = outcome, times = 5)
A single dataset would look like this. Note the structure of the explanatory variables remains the same, but the outcome is randomly shuffled.
permuted_datasets$splits[[1]] %>%
as_tibble()
We can then apply our bootstrap function to each one of these 5 permuted datasets. We might perform 10 bootstrap samples for each of the 5 permuted datasets. The model would then be applied to each dataset within the following table.
permuted_bootstrapped_datasets <- permuted_datasets %>%
map_df(.x = .$splits, .f = ~ as.data.frame(.) %>% boot_sample(., boot_reps = 10), .id = "permutation")
permuted_bootstrapped_datasets
This code is relatively lengthy, and is therefore deliberately omitted from the workshop, however is present within the stabiliser package and freely available.
Using the ecdf() function, the stability of each variable within each permutation can be calculated.
By choosing the value where quantile(., probs=1), the highest stability that might have occurred by chance can be calculated.
The mean threshold across all permutations can then be calculated. This represents the “null threshold”; i.e., the mean highest stability a variable might acheive across all permuted datasets (where we know there should be no links between variables and outcome).
Variables in the true model (i.e., non-permuted data) that have a stability value in excess of this null threshold are highly likely to be truly correlated with the outcome.
stab_output <- stabilise(outcome = "outcome", data = df_signal, models = c("mbic"), type = "linear")
stab_output$mbic$stability
The stabiliser package allows multiple models to be run simultaneously. Just select the models you wish to run in the “models” argument.
MCP is omitted here for speed. To include it, just add it to the list of models using: models = c(“mbic”, “lasso”, “mcp”)
stab_output <- stabilise(outcome = "outcome", data = df_signal, models = c("mbic", "lasso"), type = "linear")
Calculate the number of true and false positives selected through stability approaches, and rename columns to include “_stability”.
stability_results <- stab_output %>%
map_df(., ~ .x$stability, .id = "model") %>%
filter(stable == "*") %>%
calculate_tp_fp(.) %>%
rename_all(., ~ paste0(., "_stability"))
stability_results
Compare this with the non-stability approach
conventional_results %>%
left_join(stability_results, by = c("model" = "model_stability"))
The stabiliser package allows the stability selection results from multiple models to be used synergistically, and by leveraging the strenghts of various models, a more robust method of variable selection is often acheived.
triangulated_stability <- triangulate(stab_output)
triangulated_stability
## $combi
## $combi$stability
## # A tibble: 301 x 4
## variable stability bootstrap_p stable
## <chr> <dbl> <dbl> <chr>
## 1 causal_178 100 0 *
## 2 causal_257 100 0 *
## 3 causal_98 100 0 *
## 4 causal_234 99.5 0 *
## 5 causal_236 99.5 0 *
## 6 causal_199 98.5 0 *
## 7 causal_237 98 0 *
## 8 causal_282 96.5 0 *
## 9 V266 46 0 <NA>
## 10 V78 46 0 <NA>
## # ... with 291 more rows
##
## $combi$perm_thresh
## [1] 55
stab_plot(triangulated_stability)
## $combi
We can now return to our original dataset that we simulated to have no signal.
Our conventional approach performed relatively poorly, selecting the following variables as being significantly associated with the outcome variable.
prefiltration_results
Using stabiliser, the following variables are selected from the dataset.
stab_output_no_signal <- stabilise(outcome = "outcome", data = df_no_signal, models = c("mbic", "lasso"), type = "linear")
triangulated_output_no_signal <- triangulate(stab_output_no_signal)
triangulated_output_no_signal$combi$stability %>%
filter(stable == "*")
Thank you for attending this workshop. We hope you enjoyed the session, and have a good understanding of where some conventional modelling approaches might not be appropriate in wider datasets.
If you have any further questions after the workshop, please feel free to contact Martin Green (martin.green@nottingham.ac.uk) or Robert Hyde (robert.hyde4@nottingham.ac.uk).